ftp.cs.arizona.edu

home *** CD-ROM | disk | FTP | other *** search

/ ftp.cs.arizona.edu / ftp.cs.arizona.edu.tar / ftp.cs.arizona.edu / icon / newsgrp / group01b.txt / 000188_icon-group-sender_Thu Dec 6 08:27:25 2001.msg < prev next >

Wrap

Internet Message Format | 2002-01-03 | 4KB

Return-Path: <icon-group-sender> Received: (from root@localhost) by baskerville.CS.Arizona.EDU (8.11.1/8.11.1) id fB6FQMU24132 for icon-group-addresses; Thu, 6 Dec 2001 08:26:22 -0700 (MST) Message-Id: <200112061526.fB6FQMU24132@baskerville.CS.Arizona.EDU> Date: Thu, 6 Dec 2001 01:18:29 -0600 (CST) From: John Paolillo <johnp@ling.uta.edu> To: eka@corp.cirrus.com, trutkin@physics.clarku.edu Subject: Re: Bio Informatics Cc: icon-group@cs.arizona.edu Errors-To: icon-group-errors@cs.arizona.edu Status: RO Content-Length: 3651 In formal-language theoretic terms, regular expressions allow the definition of finite-state transducers (FST's). These can be thought of as being described by ordinary finite state automata (FSA) with one difference, namely that they have two tapes, an input and an output tape, and state transitions can specify read-actions on the input tape and write actions on the output tape. Some backtracking is normally allowed. All the same, we're pretty far down the Chomsky hierarchy in terms of computational power, with a little bit more than the ability to recognize regular languages (the term used in the Computational Linguistics literature for what FST's compute is "regular relation"). Icon string scanning is indeed more powerful. It is very easy to write parsers that are context-free in their computational power, using no more than the implicit arguments (&subject, &pos, etc.) of the scanning environment. When explicit argument-passing is allowed, it becomes possible to write scanning procedures that are context-sensitive and Turing-equivalent in their computational power. So the relationship between Icon scanning and Perl regexp is that Icon is the more powerful and expressive of the two. This is both a "good" thing and a "bad" thing: scanning expressions that use the context-free or context sensitive expressivity of Icon's scanning expressions may have a worst-case exponential (i.e. intractable) complexity. FST- based scanning is generally more tractable. If built-in regexps are used for scanning, you are basically forced to adopt a strategy from a tractable set of computations. Not a bad thing if you are worried about performance, but frustrating if you need the missing expressivity for something. In computational linguistics, there is renewed interest in FSA and FST-based methods, on account of the tractability issues, even though it is generally recognized that something more (what A. Joshi called "weak context sensitivity") is probably needed to adequately describe human languages. For Bio-informatics, where the main tasks are matching and assembling long gene sequences, where we don't expect to find troubling context-sensitive dependencies (if I have such-and-such a thousand-base sequence, then I should expect an approximate copy of such-and-such another sequence over there...), FSA and FST methods are probably enough. (I doubt that anyone would have tried to use context free methods even, though I could well be wrong.) And since Perl comes plonked into just about any Linux or Unix installation these days, you can get going on your sequencing just by downloading a few scripts from someone's web or ftp site. I think that would be the main reason for people using Perl (rather than, say, Java...) Of course, one can easily write procedures for FSA and FST using Icon's string scanning (I suppose this is how the regexp library routines work?). This would be my preference, for purely idiosyncratic reasons. I just think that the design of the scanning environment in Icon is an elegant and brilliant piece of work. The use of the implicit variables is what I would consider a revolutionary concept in programming language design. Once you understand it, you almost never have to interact with it; it just seamlessly does its work for you while you attend to what is interesting in your program. How unlike Prolog DCG's (which I use a lot now) or even regexps. Besides, you never know when you'll want just a little bit more expressivity... Anyway, I hope this helps, John Paolillo SLIS and Informatics Indiana University http://ella.slis.indiana.edu/~paolillo/